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The  research  proceeded  in  four  general  areas: 

(A)  Problems  of  inference  from  data  accumulating  at  the  random  times  new 
species  are  discovered  were  studied.  Procedures  for  estimating  a variety 
of  characteristics  of  the  population  and  for  predicting  the  random 
probability  that  a next  search  will  uncover  a new  species  were  developed. 

A death  process  was  Interposed  and  estimates  of  the  size  of  a population^  y 
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20.  Abstract  (cont'd) 

and  the  mean  life  of  its  members  were  obtained. 

J£)  Results  for  a one-armed  bandit  in  the  presence  of  concomitant 
information  were  developed.  These  results  apply  to  the  sequential 
allocation  of  treatments  in  medical  trials. 

j£]  Optimal  and  adaptive  stopping  based  on  the  maximum  of  a sequence  of 
dependent  observations  were  studied.  The  results  are  likely  to  have 
application  to  the  choice  of  components  in  redundant  systems  and  to  adap- 
tive quality  control. 

(B)  Methods  have  been  derived  to  estimate  the  true  significance  level 
resulting  from  repeated  significance  tests. 
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SUMMARY  OF  RESULTS 


A population  consisting  of  distinct  species  is  searched  by  selecting 
one  member  at  a time.  Each  time  a new  species  is  discovered  we  receive 
an  incremental  reward,  which  in  statistical  applications  we  regard  as 
a data  point  to  be  utilized  at  the  termination  of  the  search  in  a 
decision  concerning  some  characteristic  of  the  population.  Thus,  data 
is  accumulating  at  the  random  times  we  discover  new  species. 

The  search  may  continue  indefinitely,  but  there  is  a cost  c > 0 
associated  with  each  selection.  If  after  n selections  we  have  dis- 
covered d(n)  species  and  elect  to  terminate  the  search,  our  payoff  is 

h(d(n))  - cn  , 

where  h is  increasing.  Subject  to  mild  conditions  on  h,  it  is  opti- 
mal to  terminate  the  search  at  the  random  time 

a = first  n > 0 such  that 
(h(d(n)  + l)  - h(d(n))u(n)  < c 

where  u(n)  Is  the  conditional  probability  we  will  discover  a new 
species  at  stage  n+1  of  the  search,  given  the  results  of  the  search 
up  to  stage  n. 

As  an  example  of  a statistical  application  of  this  result,  suppose 
that  a specified  character  is  either  present  in  all  members  of  a given 


species,  or  else  absent.  Our  objective  is  to  estimate  the  proportion 
9 of  species  carrying  the  character.  If  we  terminate  the  search  at 

a 

stage  n we  must  report  an  estimate  9n  of  9,  and  incur  the  normal- 
ized loss 


L(n) 


<ve>‘ 

6(1  - 6) 


+ cn 


here  c is  the  cost  of  each  stage  of  the  search.  The  Bayes  strategy 
with  respect  to  a uniform  (0,1)  prior  on  9 is  to  terminate  the 
search  at  time 


s = first  n ^ 0 such  that 
u(n)  < cd(n)(d(n)  + 1) 

and  to  estimate  6 with  9$  * the  proportion  of  those  species  discovered 
up  to  time  s which  carry  the  character.  This  strategy  is  also  mi-nimax. 
Our  results  have  potential  application  (for  example)  to  studies  of 


literary  vocabulary  or  codes,  species  of  animals,  insects,  or  microbes 
in  a given  area,  personality  types,  and  levels  of  job  performance.  The 
obvious  limitation  in  the  applicability  is  that  if  the  composition  of 
the  population  is  unknown,  then  the  random  probability  u(n)  that  a 
new  species  will  be  discovered  at  the  next  stage  of  the  search  is  unob- 
servable, and  the  optimal  strategies  are  unusable.  To  remedy  this 
deficiency  we  undertook  a detailed  study  of  a class  of  observable 
predictors  of  u(n)  based  on  the  frequency  of  frequencies  of  distinct 
species  that  have  been  discovered  in  the  search,  and  of  the  performance 


-3- 


of  adaptive  strategies  which  result  from  "estimating"  u(n)  with  these 
predictors.  We  obtained  a unique  class  of  linear  predictors  which 
were  optimal  for  "estimating"  u(n)  in  a cormonly  accepted  statistical 
sense  and  which  led  to  adaptive  strategies  which  appear  (from  simulations) 
to  perform  well  against  the  optimal  strategy  when  the  composition  of  the 
underlying  population  is  unknown. 

In  a dissertation  (in  preparation),  Mr.  Carlos  Lima  has  extended 
this  study  by  imposing  a death  process  on  the  model.  Suppose  that  a 
population  of  animals  are  trapped  at  given  intervals  of  time  in  the  wild, 
and  that  each  animal  has  an  associated  trap  probability  (trappability). 

Of  interest  are  the  initial  number  of  animals  in  the  population,  the 
number  of  species,  and  their  average  lifetimes.  Standard  estimates 
(maximum  likelihood)  are  biased  because  of  the  difficulty  of  estimating 
the  proportion  of  the  population  which  is  never  seen  (either  because  of 
premature  death  or  small  trappability).  Predictors  of  the  type  we  have 
studied  apparently  lead  to  estimators  which  apparently  reduce  this  bias 
significantly. 

References:  [2],  [4],  and  [7] 


B.  For  definiteness,  we  describe  this  problem  in  the  language  of 

medical  trials;  other  applications  are  mentioned  later. 

We  suppose  that  patients  arrive  sequentially  and  may  be  treated 
with  either  a standard  treatment  (S)  or  a competing  experimental  treat- 
ment (E).  Before  deciding  which  treatment  to  administer  to  a particular 
patient,  we  observe  an  associate  vector  X of  relevant  concomitant 
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variables;  for  example,  X = (age,  sex,  severity  of  disease,  time  since 
onset,  etc.).  Given  that  X = x,  let  S(x)  denote  the  patient's 
response  if  we  administer  S and  E(x)  if  we  administer  E.  It  is 
assumed  that  the  statistical  properties  of  S(x)  (the  patient's  response 
to  the  standard  treatment)  are  known.  For  the  jth  patient  let 

S(Xi)  If  patient  j receives  S 
Y.  = J 

J E(Xj)  patient  j receives  E . 

Our  objective  is  to  sequentially  administer  S and  E in  such  a 
manner  that  the  average  value  of 

00  1 
I cr  Y . 

j-1  J 

will  be  a maximum,  where  0 < a < 1.  We  have  adopted  a Bayesian  approach 
and  developed  asymptotically  optimal  solutions  as  a -»•  1 . Our  results 
have  the  following,  surprising  implications. 

1.  The  approximate  solution  to  the  allocation  problem  is  simpler  in  the 
presence  of  the  concomitant  information  X then  in  its  absence; 

2.  The  myopic  procedure  is  asymptotically  optimal; 

3.  The  role  played  by  the  number  N of  future  patients  in  earlier  work 
Is  played  by  the  discount  factor  a in  ours.  The  exact  value  of 
the  discount  factor  a (equivalently,  knowledge  of  N),  is  of  less 
Importance  in  the  presence  of  concomitant  information  than  in  earlier 
work  which  did  not  take  account  of  this  Information. 


Reference:  [9] 


j C.  Suppose  we  may  consecutively  observe  variables  Xj,X£ For 

example,  each  x might  represent  the  quality  of  a fabricated  component 


or  the  response  to  a training  session.  j 

(i)  Let  mn  denote  the  largest  x-value  observed  after  n trials. 

If  sampling  is  terminated  with  n trials,  and  mn  = y say,  we 
receive  a payoff  f(n,y)  which  we  suppose  is  increasing  in  y 

j 

and  decreasing  in  n.  (For  example,  if  there  is  a cost  c > 0 

th 

for  manufacturing  a component,  and  we  choose  to  use  the  n com- 
ponent that  we  manufacture  in  a system,  then  f(n,y)  * y - cn  is 
an  appropriate  payoff  function).  Our  objective  in  this  research 
has  been  to  describe  in  a form  usable  in  practice  the  time  a for 
which  the  average  value  of  f(a,mo)  will  be  a maximum.  Our  results 
apply  to  a special  form  of  dependence  among  the  x values  (corres- 
ponding to  an  urn  sampling  without  replacement).  We  prove  that 

a = first  n >_  1 such  that  mn  > Bn 

is  optimal  (conditions  on  f are  assumed),  and  provide  a simple 
algorithm  for  computing  the  3n-values.  We  show  also  that  the 
myopic  strategy 

o'  * first  n > 1 such  that  mn  > 3 


is  asymptotically  optimal  (where  B is  easily  computable). 
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A student.  Dr.  Tony  Tai,  has  considered  a similar  problem 
in  the  context  of  a modified  Ehrenfest  model  (an  elastic  model). 

He  has  derived  the  optimal  stopping  time,  and  also  studied  in 
detail  the  problem  of  estimating  the  size  of  the  population 
from  which  the  model  derives  via  methods  of  likelihood. 

References:  [3]  and  [8] 

(ii)  Suppose  x,,x2*...  are  i.i.d.  random  variables  with  absolutely 
continuous  distribution  F.  Then, by  the  probability  integral 
transformation  y..  = F(x^),  i=l,2,...  are  i.i.d.  Uniform  (0,1). 
Suppose  that  F = F0  is  known  only  up  to  a vector  6 of  unknown 
parameters. 

A 

Theorem.  If  6^  = 01-(x1 , . . . .x^)  is  a complete  sufficient  statistic 

A 1 1 

and  eT  xi  is  ancillary  (6  denotes  a group  transformation  acting 
on  x).  Then  the  sequence 


yi  ~ Fg  (x.j)  i— 1*2,... 

are  independent  with  distribution  independent  of  0 (and  in 
general  easily  derivable). 

We  expect  this  result  to  be  useful  for  constructing  new 
quality  control  plans  (when  for  example  the  target  value  0 is 
unknown)  and  for  procedures  to  detect  the  time  at  which  there  may 
have  occurred  a shift  (in  the  mean  or  variability)  of  a process. 

References:  [6] 
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D.  Let  X-j  ,X2» • • • denote  independent  random  variables  whose 

densities  f , weft,  constitute  a natural  exponential  family  and 
consider  testing  an  hypothesis  of  the  form  HQ:  a>  e ftQ,  where  ftQ  is 
a lower  dimensional  submanifold  of  the  natural  parameter  space  ft.  If 

I 

a sample  of  predetermined  size  n is  taken,  then  the  null  distribution 
of  the  likelihood  ratio  statistic 

\ . (supn  - sup  j^log  fJX,) 

2 

is  approximately  that  of  4xr  » where  r is  the  codimension  of  ftg 
in  ft.  However,  if  the  sample  size  is  allowed  to  depend  on  the  data 
in  a non-anticipating  manner,  then  the  chi-square  approximation  to  the 
distribution  of  An  may  be  poor.  In  particular,  if  s is  a stopping 
time  the  chi-square  approximation  may  badly  underestimate  the  attained 
significance  levels  P{A  $ > a),  where  u>  e ftg  and  a > 0.  If  we 
restrict  attention  to  stopping  times  which  are  bounded  by  a fixed 
number  N,  then  the  latter  probability  is  maximized  by  the  stopping 
time 

t = inf{n  > m:  AR  > a or  n > N)  , 

where  m is  an  Initial  sample  size  (so  chosen  that  An  is  finite  for 
all  n >_  m). 

We  have  obtained  approximations  to  P{Afc  > a}  for  u>  e ftQ  and 
large  a In  a context  which  is  sufficiently  general  to  Include  many 
multivariate  applications.  Our  results  show  that  the  actual  attained  ] 
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2 

significance  level  may  exceed  the  nominal  Pr(xr  > s}  by  a factor  of 
10-20  times  for  50  < N £ 300  when  r is  small.  These  results 
confirm  and  extend  the  numerical  work  of  Armitage  et  aK  and  Siegmund's 
treatment  of  repeated  t- tests. 

References:  [1],  [5],  [10],  [11]  and  [12] 
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